SFUs (special function units)
-
Note that transcendental operations (sin/cos, rsqrt) may be executed on SFUs with different latencies/throughput.
Tensor/Matrix cores, Ray-tracing cores
-
Mention specialized units for matrix multiply/accumulate or ray traversal that change performance characteristics for algorithms that use them.
Asynchronous copies / DMA engines
-
Add async copy mechanisms (device to shared, or staging) that allow overlap of memory transfer with compute, when supported.